In 2019, there were nearly 8,500 terrorist attacks around the world, killing more than 20,000 people (Source: Global Terrorism Overview). In order to successfully mitigate and combat terrorism it is imperative to understand the complex geopolitial dynamics that enable terrorism and terrorist ideologies. This project aims to analyze what one can infer about terrorist ideologies from data about their attacks. If one can identify causal relationships between characteristics of terrorists, including their ideologies and group structure, and the types of attacks perpetrated by these terrorists, then perhaps we could create informed policy to predict and mitigate terrorism.
Before proceeding, it is necessary to define key terms. Terrorism is notoriously difficult to define, and definitions are largely disagreed upon throughout industry, academia, and government. We will use data from the Global Terrorism Database (GTD) as a basis for our analysis, and therefore will utilize the definitions of terrorism and terrorist attacks provided in the datasets codebook:
A terrorist attack is the threatened or actual use of illegal force and violence by a non-state actor to gain political, economic, religious, or social goals through fear coercion, or intimidation.
(Source: GTD Codebook)
It is worth analyzing this definition to gain a better understanding of what is and what is not terrorism for the purposes of this project. The codebook indicates that terrorist attacks must be intentional, that does not mean that the attack is carried out exactly as intended, but rather that there is an intended target, a method by which to inflict harm, and perhaps evidence of planning. Additionally, there a terrorist attack must include violence, or immediate threat of violence. This includes violence against both people and property. Violence in the codebook is to mean intention to cause injury and/or irrevocable destruction/kinectic damage. It is worth noting that the perpetrators must be sub-national actors. The database does not include acts of state terrorism--including persons who are employed by the state and/or are acting on behalf of a state or nation. This criteria does not exclude state-sponsored attacks, but rather only attacks perpetrated by state actors.
As with the definition above, two of the following three criteria must be met for inclusion in the dataset: 1. The act must be aimed at attaining a political, economic, religious, or social goal. In terms of economic goals, the exclusive pursuit of profit does not satisfy this criterion. It must involve the pursuit of more profound, systemic economic change. 2. There must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience (or audiences) than the immediate victims. It is the act taken as a totality that is considered, irrespective if every individual involved in carrying out the act was aware of this intention. As long as any of the planners or decision-makers behind the attack intended to coerce, intimidate or publicize, the intentionality criterion is met. 3. The action must be outside the context of legitimate warfare activities. That is, the act must be outside the parameters permitted by international humanitarian law.
For additional explanation of these criteria, as well as examples, please see the GTD Codebook.
Thankfully, the most tedious part of the data science pipeline has been done for us. The researchers over at the National Consortium for the Study of Terorrism and Responses to Terrorism (START) has amalgamated data for global terrorism incidents from 1970-2019 in their Global Terrorism Database. The database, informed by open-source media articles, contains more than 100 structured variables characterize each attack’s location, tactics and weapons, targets, perpetrators, casualties and consequences, and general information such as definitional criteria and links between coordinated attacks. Unstructured variables include summary descriptions of the attacks and more detailed information on the weapons used, and specific motives of the attackers. The GTD is accessible for individuals and organizations from their website: https://start.umd.edu/gtd/.
While the methodology for collecting data has evolved since the inception of the database in 2006, it is worth mentioning the hybrid workflow the GTD employs to collect, process, and publish data today. The process starts with a pool of more than two million open-source media reports published each day. The GTD team combines automated and human workflows, leveraging the strengths and mitigating the limitations of each, to produce rich and reliable data. On the automated side, GTD researchers leverage boolean filters of articles, natural language processing (NLP), deduplication of articles, location identification, clustering of similar articles, and machine learning (ML) models to identify relevancy of articles. After the automated process has gathered, filtered and labeled articles, a team of analysts triage the articles to assess source validity, apply inclusion criteria, and create narratives of single incidents from multiple sources. The incidents are then coded by smaller teams, with specific domain expertise.
After creating an indiviual-use account for the GTD, we download the dataset and import as a dataframe using pandas.
#!pip -q install wordcloud gensim nltk pyLDAvis
import pandas as pd
import numpy as np
gtd_full = pd.read_excel("globalterrorismdb_0221dist.xlsx")
gtd_full.head()
| eventid | iyear | imonth | iday | approxdate | extended | resolution | country | country_txt | region | ... | addnotes | scite1 | scite2 | scite3 | dbsource | INT_LOG | INT_IDEO | INT_MISC | INT_ANY | related | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 197000000001 | 1970 | 7 | 2 | NaN | 0 | NaT | 58 | Dominican Republic | 2 | ... | NaN | NaN | NaN | NaN | PGIS | 0 | 0 | 0 | 0 | NaN |
| 1 | 197000000002 | 1970 | 0 | 0 | NaN | 0 | NaT | 130 | Mexico | 1 | ... | NaN | NaN | NaN | NaN | PGIS | 0 | 1 | 1 | 1 | NaN |
| 2 | 197001000001 | 1970 | 1 | 0 | NaN | 0 | NaT | 160 | Philippines | 5 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
| 3 | 197001000002 | 1970 | 1 | 0 | NaN | 0 | NaT | 78 | Greece | 8 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
| 4 | 197001000003 | 1970 | 1 | 0 | NaN | 0 | NaT | 101 | Japan | 4 | ... | NaN | NaN | NaN | NaN | PGIS | -9 | -9 | 1 | 1 | NaN |
5 rows × 135 columns
Perhaps the first thing to note is that this dataframe quite large to be manipulating in a Jupyter Notebok. It contains over 200,000 incidents and 135 columns, and takes up about 100MB of memory. A dataset of this size may not be considered "big data", but it warrants careful consideration of how we analyze the data to avoid long wait times and computational inefficiency. First we are going to "clean" the data, by taking a subset of the columns that we will be using for our analysis, then our dataframe will be easier to iterate over and operate on and easier to read. Let's start our analysis with some simple plots. First let's look at the number of terrorist incidents and casualties over time
#openpyxl is a dependency of pd.to_excel
#!pip install openpyxl
#subset of columns we will be working with in this project
subset = ['iyear', 'imonth', 'iday', 'country', 'country_txt', 'region', 'region_txt', 'provstate', 'city',
'attacktype1', 'attacktype1_txt', 'targtype1','targtype1_txt', 'gname', 'motive',
'weaptype1', 'weaptype1_txt', 'nkill', 'nwound']
#drop all the columns that aren't in the subset list above
gtd = gtd_full[subset]
#export this dataframe as its own excel file so we can load it faster in the future
gtd.to_excel('gtd-mini.xlsx', index = False)
gtd.head()
| iyear | imonth | iday | country | country_txt | region | region_txt | provstate | city | attacktype1 | attacktype1_txt | targtype1 | targtype1_txt | gname | motive | weaptype1 | weaptype1_txt | nkill | nwound | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1970 | 7 | 2 | 58 | Dominican Republic | 2 | Central America & Caribbean | National | Santo Domingo | 1 | Assassination | 14 | Private Citizens & Property | MANO-D | NaN | 13 | Unknown | 1.0 | 0.0 |
| 1 | 1970 | 0 | 0 | 130 | Mexico | 1 | North America | Federal | Mexico city | 6 | Hostage Taking (Kidnapping) | 7 | Government (Diplomatic) | 23rd of September Communist League | NaN | 13 | Unknown | 0.0 | 0.0 |
| 2 | 1970 | 1 | 0 | 160 | Philippines | 5 | Southeast Asia | Tarlac | Unknown | 1 | Assassination | 10 | Journalists & Media | Unknown | NaN | 13 | Unknown | 1.0 | 0.0 |
| 3 | 1970 | 1 | 0 | 78 | Greece | 8 | Western Europe | Attica | Athens | 3 | Bombing/Explosion | 7 | Government (Diplomatic) | Unknown | NaN | 6 | Explosives | NaN | NaN |
| 4 | 1970 | 1 | 0 | 101 | Japan | 4 | East Asia | Fukouka | Fukouka | 7 | Facility/Infrastructure Attack | 7 | Government (Diplomatic) | Unknown | NaN | 8 | Incendiary | NaN | NaN |
gtd = pd.read_excel('gtd-mini.xlsx')
gtd.head()
| iyear | imonth | iday | country | country_txt | region | region_txt | provstate | city | attacktype1 | attacktype1_txt | targtype1 | targtype1_txt | gname | motive | weaptype1 | weaptype1_txt | nkill | nwound | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1970 | 7 | 2 | 58 | Dominican Republic | 2 | Central America & Caribbean | National | Santo Domingo | 1 | Assassination | 14 | Private Citizens & Property | MANO-D | NaN | 13 | Unknown | 1.0 | 0.0 |
| 1 | 1970 | 0 | 0 | 130 | Mexico | 1 | North America | Federal | Mexico city | 6 | Hostage Taking (Kidnapping) | 7 | Government (Diplomatic) | 23rd of September Communist League | NaN | 13 | Unknown | 0.0 | 0.0 |
| 2 | 1970 | 1 | 0 | 160 | Philippines | 5 | Southeast Asia | Tarlac | Unknown | 1 | Assassination | 10 | Journalists & Media | Unknown | NaN | 13 | Unknown | 1.0 | 0.0 |
| 3 | 1970 | 1 | 0 | 78 | Greece | 8 | Western Europe | Attica | Athens | 3 | Bombing/Explosion | 7 | Government (Diplomatic) | Unknown | NaN | 6 | Explosives | NaN | NaN |
| 4 | 1970 | 1 | 0 | 101 | Japan | 4 | East Asia | Fukouka | Fukouka | 7 | Facility/Infrastructure Attack | 7 | Government (Diplomatic) | Unknown | NaN | 8 | Incendiary | NaN | NaN |
The new dataframe is about one-sixth the size of the original, now 15.2MB. The original dataframe contained columns that described sources, validity, detailed text descriptions, and more. With the exception of two unstructured text columns, we have only kept structured data describing the incidents. We breakdown columns and what they are recording in the table below:
| Column Name | Variable Name | Data Type | Description |
|---|---|---|---|
| iyear | Year | interval | This field contains the year in which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the year when the incident was initiated. |
| imonth | Month | categorical | This field contains the number of the month in which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the month when the incident was initiated. |
| iday | Day | interval | This field contains the numeric day of the month on which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the day when the incident was initiated. |
| country country_txt |
Country | categorical | This field identifies the country or location where the incident occurred. Separatist regions, such as Kashmir, Chechnya, South Ossetia, Transnistria, or Republic of Cabinda, are coded as part of the “home” country. |
| region region_txt |
Region | categorical | This field identifies the region in which the incident occurred. The regions are divided into the 13 categories, and dependent on the country coded for the case: North America, Central America & Caribbean, South America, East Asia, Southeast Asia, South Asia, Central Asia, Western Europe, Eastern Europe, Middle East & North Africa, Sub-Saharan Africa, amd Australasia & Oceania. |
| provstate | Province/State | text | This variable records the name (at the time of event) of the 1st order subnational administrative region in which the event occurs. |
| city | City | text | This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district). |
| attacktype1 attacktype1_txt |
Attack Type | categorical | This field captures the general method of attack and often reflects the broad class of tactics used. It consists of nine categories, which are listed here: Assassination, Hijacking, Kidnapping, Barricade Incident, Bombing/Explosion, Armed Assault, Unarmed Assault, Facility/Infrastructure Attack, Unknown. |
| targtype1 targtype1_txt |
Target Type | categorical | The target/victim type field captures the general type of target/victim. When a victim is attacked specifically because of his or her relationship to a particular person, such as a prominent figure, the target type reflects that motive. For example, if a family member of a government official is attacked because of his or her relationship to that individual, the type of target is “government.” This variable consists of 22 categories that can be found in the GTD codebook. |
| gname | Perpetrator Group Name | unstructured text | This field contains the name of the group that carried out the attack. In order to ensure consistency in the usage of group names for the database, the GTD database uses a standardized list of group names that have been established by project staff to serve as a reference for all subsequent entries. In the event that the name of a formal perpetrator group or organization is not reported in source materials, this field may contain relevant information about the generic identity of the perpetrator(s) (e.g., “Protestant Extremists”). Note that these categories do not represent discrete entities. They are not exhaustive or mutually exclusive (e.g., “student radicals” and “left-wing militants” may describe the same people). They also do not characterize the behavior of an entire population or ideological movement. For many attacks, generic identifiers are the only information available about the perpetrators. Because of this they are included in the database to provide context; however, analysis of generic identifiers should be interpreted with caution. |
| city | City | text | This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district). |
| motive | Motive | unstructured text | When reports explicitly mention a specific motive for the attack, this motive is recorded in the “Motive” field. This field may also include general information about the political, social, or economic climate at the time of the attack if considered relevant to the motivation underlying the incident. Note: This field is presently only systematically available with incidents occurring after 1997. |
| weaptype1 weaptype1_txt |
Weapon Type | categorical | This field records the general type of weapon used in the incident. It consists of the following categories: Biological, Chemical, Radiological, Nuclear, Firearms, Explosives, Fake Weapons, Incendiary, Melee, Vehicle, Sabotage Equipment, Other, and Unknown. |
| nkill | Total Number of Fatalities | ratio | This field stores the number of total confirmed fatalities for the incident. The number includes all victims and attackers who died as a direct result of the incident. Where there is evidence of fatalities, but a figure is not reported or it is too vague to be of use, such as “many” or “some,” this field remains blank. |
| nwound | Total Number of Injured | ratio | This field records the number of confirmed non-fatal injuries to both perpetrators and victims. It follows the conventions of the “Total Number of Fatalities” field described above. |
With the exception of gname and motivation, all the variables we have included in the data frame are structured and well-defined. For additional information on each of the variables and examples of how they are coded, see the GTD codebook. Now that we have our dataframe, we can proceed to some exploratory data analysis.
Let's begin by examining the correlation between the variables in our dataset using a heatmap. Note that we need to unpivot our categorical variables before attempting to identify correlation between variables in our dataset, otherwise the corr method will assume that the categorical numerical variables are actually interval variables.
import seaborn as sns
import matplotlib.pyplot as plt
pgtd = gtd.copy()
pgtd = pd.concat([pgtd, pd.get_dummies(pgtd.region_txt, drop_first = False, prefix = 'region')], axis=1)
pgtd = pd.concat([pgtd, pd.get_dummies(pgtd.attacktype1_txt, drop_first = False, prefix = 'attack_type')], axis=1)
pgtd = pd.concat([pgtd, pd.get_dummies(pgtd.targtype1_txt, drop_first = False, prefix = 'target_type')], axis=1)
pgtd = pd.concat([pgtd, pd.get_dummies(pgtd.weaptype1_txt, drop_first = False, prefix = 'weapon_type')], axis=1)
pgtd = pgtd.drop(columns = ['country', 'region', 'region_txt', 'attacktype1_txt', 'attacktype1',
'targtype1_txt', 'targtype1','weaptype1_txt','weaptype1'])
corr = pgtd.corr()
fig = plt.figure(figsize = (20,20), dpi = 150)
ax = sns.heatmap(
corr,
vmin=-1, vmax=1, center=0,
cmap=sns.diverging_palette(20, 220, n=200),
square=True,
#annot=True,
linewidths=0.5
)
ax.set_xticklabels(
ax.get_xticklabels(),
rotation=45,
horizontalalignment='right'
);
The plot above is a correlation matrix that shows the correlation coefficient between each variable in the dataset. Unfortunately it does not seem as though there are many variables that are highly correlated in our dataset. While our visualization is fun, it is clunky and hard to interpret. Let's examine further by finding all of the pairs of variables who correlation coefficient is greater than 0.3 or less than -0.3. It is generally accepted, although arbitrary, that 0.3 represents a weak correlation between variables, between 0.3 and 0.7 implies moderate correlation and 0.7 or greater implies strong correlation between variables.
#Due to the symmetry of the matrix, we only need to iterate over the bottom left triangle.
k = 0
for i in range(len(corr.columns)):
for j in range(i + 1, len(corr.columns)):
#If the correlation coefficient is greater than 0.3 print
if abs(corr.iloc[i,j]) > 0.3:
k += 1
print(str(k)+".", corr.columns[i], corr.columns[j], corr.iloc[i,j])
1. iyear region_Central America & Caribbean -0.33253279811635467 2. iyear region_South America -0.3164820104707977 3. iyear region_Western Europe -0.3240107665572563 4. nkill nwound 0.5374432605321358 5. region_Middle East & North Africa region_South Asia -0.3652109879128187 6. attack_type_Armed Assault attack_type_Bombing/Explosion -0.5273801348933286 7. attack_type_Armed Assault weapon_type_Explosives -0.4996559996982958 8. attack_type_Armed Assault weapon_type_Firearms 0.6388349131961226 9. attack_type_Assassination attack_type_Bombing/Explosion -0.3233224606071381 10. attack_type_Bombing/Explosion weapon_type_Explosives 0.9228123910805127 11. attack_type_Bombing/Explosion weapon_type_Firearms -0.6394077534737226 12. attack_type_Facility/Infrastructure Attack weapon_type_Incendiary 0.7588477245915957 13. attack_type_Hostage Taking (Kidnapping) weapon_type_Unknown 0.3111172330604915 14. attack_type_Unarmed Assault weapon_type_Chemical 0.33355681982406926 15. attack_type_Unknown weapon_type_Unknown 0.7031133667414636 16. weapon_type_Explosives weapon_type_Firearms -0.6884922330769069 17. weapon_type_Explosives weapon_type_Unknown -0.31705097229117324
We have 17 pairs of variables that are at least weakly correlated. Let's see if any of the relationships are not easily explainable or artificial. The first threes pairs show a negative correlation between year and three region in the data sets. This likely points to a decrease in terrorism in Central America & Caribbean, South America, and Western Europe from 1970 through 2019. The fourth pair shows correlation between casualties and injuries, which is intuitive; the more people who are injured in an attack, the more likely there are to be casualties and vice-versa. Pairs six through eight indicate that there during incidents classifies as armed attacks there are typically firearms used rather than other types of weapons, this is obvious. Similarly the correlations in pairs 10 and 11, shows that during attack types where there is a bombing or explosive, there is a very strong correlation to the weapon type being a bomb; once again, this is obvious. Pair 12 shows that there is a strong correlation between incidents classified as attacks on infrastructure and the use of incindiery weapons. This likely just means that the number of arson cases in the GTD is far greater than any attacks on people using incindiery weapons. Pair 13 indicates that there is a weak correlation between incidents classified as kidnappings and incidents where the weapon type was unknown. Perplexingly, there is an interesting correlation between incidents classified assaults and the use of chemical weapons. This perhaps has to do with how incidents in the GTD are coded, but warrant further investigation. Pair 15 shows that there is a strong correlation between incidents where the weapon was unknown and incidents where the attack classification was unknown. This is likely a reflection of gaps in open-source data. Finally pairs 5, 16 and 17 are artificial correlations, since they were derived from the same variable and are therefore meaningless. Now that we have examined our correlated pairs for, let's investigate some of the less obvious correlations. Namely, the correlation between year and region; the correlation between incidiery weapons and attacks on infrastructure; and the correlation between unarmed assaults and the use of chemical weapons.
Let's begin by plotting some general information about terrorist attacks over time, such as the number of attacks per year and the number of casualties per year.
from collections import OrderedDict
N_dict = {}; K_dict = {} #N_dict will store year:number of attack pairs, K_dict will store year:number of casualties pairs
for year in gtd.iyear.unique():
N_dict[year] = len(gtd[gtd['iyear'] == year])
K_dict[year] = gtd[gtd['iyear'] == year].nkill.sum()
N_dict = OrderedDict(sorted(N_dict.items()))
K_dict = OrderedDict(sorted(K_dict.items()))
#Set shape of plot
fig = plt.figure(figsize = (15,7.5), dpi = 150)
#Add the plot, plot title, and axes labels
ax = fig.add_subplot(title = 'Terrorist Attacks (1970-2019)', xlabel = 'Year')
#Set x-axis ticks to one per year from 1970-2019
plt.xticks(np.arange(1970, 2021, step=5), rotation = 45)
plt.xlim(1969, 2021)
#Plot Number of Terrorist Attacks per Year
ax.plot(N_dict.keys(),
N_dict.values(),
label = "Number of Terrorist Attacks",
alpha = 0.75)
#Plot Number of Casualties per Year
ax.plot(K_dict.keys(),
K_dict.values(),
label = "Number of Casualties from Terrorist Attacks",
alpha = 0.75)
#plot legend
plt.legend()
#add gridlines to plot
plt.grid(linewidth = 0.1)
Generally, we see trends of increasing and decreasing attacks and casualties over time, with a notable spike followed by a rapid decline in 2014. The spike can attributed the number of attacks per year and the number of casualties per year. The spike coincides with the formation of the Islamic State of Iraq and Syria (ISIS) in 2013, and its declaration of a caliphate in 2014.
Let's recreate the plot above, for each of the regions in the dataset:
from sklearn.linear_model import LinearRegression
KN_dict = {}; #N_dict will store year:number of attack pairs, K_dict will store year:number of casualties pairs
for year in gtd.iyear.unique():
KN_dict[len(gtd[gtd['iyear'] == year])] = gtd[gtd['iyear'] == year].nkill.sum()
KN_dict = OrderedDict(sorted(KN_dict.items()))
#We need to reshape our data to fit the data structure expected by the linear regression
X = np.array(list(KN_dict.keys())).reshape(-1, 1)
y = np.array(list(KN_dict.values())).reshape(-1, 1)
#fit linear regression to data
lr = LinearRegression().fit(X, y)
#print the equation of the model
print(
"Equation of Linear Model:\n"+
"Y = ", lr.coef_[0][0],"* X ", lr.intercept_[0])
Equation of Linear Model: Y = 2.4155987387383884 * X -606.7020623592907
#Set shape of plot
fig = plt.figure(figsize = (15,7.5), dpi = 150)
#Add the plot, plot title, and axes labels
ax = fig.add_subplot(title = 'Attacks per year vs Casualties per year (1970-2019)', xlabel = 'Number of Attacks', ylabel = 'Casualties')
#Plot Number of Terrorist Attacks per Year
ax.scatter(KN_dict.keys(),
KN_dict.values(),
label = "Number of Terrorist Attacks",
alpha = 0.75)
#Plot the linear regression of the Global Life Expectancy
ax.plot(X, lr.predict(X), label = "Linear Regression of Casualties per Attack")
#plot legend
plt.legend()
#add gridlines to plot
plt.grid(linewidth = 0.1)
from matplotlib import cm
from numpy import linspace
colors = [ cm.gist_ncar(int(x*256/len(regions))) for x in range(len(regions))]
#colors = cm.Set3.colors
regions = gtd.region_txt.unique()
#Set shape of plot: 12 subplots
fig, axs = plt.subplots(len(regions), figsize = (16, 64), dpi = 150)
i = 0
for i in range(len(regions)):
N_dict = {}; K_dict = {} #N_dict will store year:number of attack pairs, K_dict will store year:number of casualties pairs
for year in gtd.iyear.unique():
N_dict[year] = len(gtd[(gtd['iyear'] == year) & (gtd['region_txt'] == regions[i])])
K_dict[year] = gtd[(gtd['iyear'] == year) & (gtd['region_txt'] == regions[i])].nkill.sum()
N_dict = OrderedDict(sorted(N_dict.items()))
K_dict = OrderedDict(sorted(K_dict.items()))
#Set x-axis ticks to one per year from 1970-2019
plt.xticks(np.arange(1970, 2021, step=5), rotation = 45)
plt.xlim(1969, 2021)
#Plot Number of Terrorist Attacks per Year
axs[i].plot(N_dict.keys(),
N_dict.values(),
label = str(regions[i])+" Attacks",
color = colors[i])
#alpha = 1)
ax2 = axs[i].twinx()
#Plot Number of Casualties per Year
ax2.plot(K_dict.keys(),
K_dict.values(),
label = str(regions[i])+" Casualties",
color = colors[i],
linestyle = "dashed",
alpha = 0.75)
ax2.set_ylabel('Casualties')
#add gridlines to plot
axs[i].grid(linewidth = 0.1)
#set title and axis labels
axs[i].set_title('Terrorist Attacks and Casualties by Year ('+regions[i]+')')
plt.setp(axs[i], xlabel = 'Year')
plt.setp(axs[i], ylabel = 'Number of Attacks')
plt.setp(axs[i], xticks = np.arange(1970, 2021, step=5))
#plt.setp(axs[i], ylabel = 'Win Percentage(%)')
#plot legend
#plt.legend()
regions = gtd.region_txt.unique()
#Set shape of plot: 12 subplots
fig, axs = plt.subplots(len(regions), figsize = (16, 64), dpi = 150)
i = 0
for i in range(len(regions)):
KN_dict = {}; #N_dict will store year:number of attack pairs, K_dict will store year:number of casualties pairs
for year in gtd.iyear.unique():
KN_dict[len(gtd[(gtd['iyear'] == year) & (gtd['region_txt'] == regions[i])])] = gtd[(gtd['iyear'] == year) & (gtd['region_txt'] == regions[i])].nkill.sum()
KN_dict = OrderedDict(sorted(KN_dict.items()))
#We need to reshape our data to fit the data structure expected by the linear regression
X = np.array(list(KN_dict.keys())).reshape(-1, 1)
y = np.array(list(KN_dict.values())).reshape(-1, 1)
#fit linear regression to data
lr = LinearRegression().fit(X, y)
#Plot Number of Terrorist Attacks per Year
axs[i].scatter(KN_dict.keys(),
KN_dict.values(),
label = str(regions[i])+" Attacks",
color = colors[i],
alpha = 0.5)
#Plot the linear regression of the Global Life Expectancy
axs[i].plot(X, lr.predict(X),
label = "Linear Regression of Casualties per Attack",
color = colors[i])
#add gridlines to plot
axs[i].grid(linewidth = 0.1)
#set title and axis labels
axs[i].set_title('Terrorist Casualties per year vs. Attacks per year ('+regions[i]+')')
plt.setp(axs[i], xlabel = 'Number of Attacks per year')
plt.setp(axs[i], ylabel = 'Number of Casualties per year')
#plt.setp(axs[i], ylabel = 'Win Percentage(%)')
#plot legend
#plt.legend()
Let's continue, by looking at a violin plot of attacks per year by region.
atk_per_region = []
regions = gtd.region_txt.unique()
for reg in regions:
temp_list = []
for year in gtd.iyear.unique():
temp_list.append(len(gtd[(gtd.region_txt == reg) & (gtd.iyear == year)]))
atk_per_region.append(temp_list)
#Set shape of plot
fig = plt.figure(figsize = (20,10), dpi = 150)
#Add the plot, plot title, and axes labels
ax = fig.add_subplot(title = 'Terrorist Attacks (1970-2019)', xlabel = 'Year')
#Violin plot
ax.violinplot(atk_per_region, widths=0.9, showmeans=True)
ax.set_xticks(np.arange(1,13))
ax.set_xticklabels(regions, rotation = 45)
#plt.yscale("log")
#add gridlines to plot
plt.grid(linewidth = 0.1)
plt.show()
Some commentary here
wep_list = gtd.weaptype1_txt.unique()
targ_list = gtd.targtype1_txt.unique()
a = np.random.random((len(wep_list), (len(targ_list))))
for i in range(len(wep_list)):
for j in range(len(targ_list)):
a[i][j] = len(gtd[(gtd['targtype1_txt'] == targ_list[j]) & (gtd['weaptype1_txt'] == wep_list[i])])/len(gtd[(gtd['weaptype1_txt'] == wep_list[i])])
fig = plt.figure(figsize = (12,22), dpi = 150)
ax = fig.add_subplot(title = 'Target vs Weapon Heatmap', ylabel = 'Weapon Type', xlabel = 'Target Type')
wep_list_label = wep_list.copy()
wep_list_label[7] = "Vehicle"
ax.set_yticks(np.arange(len(wep_list)))
ax.set_xticks(np.arange(len(targ_list)))
ax.set_yticklabels(wep_list_label)
ax.set_xticklabels(targ_list, rotation = 90)
ax.imshow(a)
plt.show()
b = np.random.random((6,(len(wep_list))))
for i in range(len(wep_list)):
for j in range(6):
b[i][j] = len(gtd[(gtd['nkill'] == j) & (gtd['weaptype1_txt'] == wep_list[i])])/len(gtd[(gtd['weaptype1_txt'] == wep_list[i])])
fig = plt.figure(figsize = (12,22), dpi = 150)
ax = fig.add_subplot(title = 'Weapon vs Casualties Heatmap', ylabel = 'Casualties', xlabel = 'Weapon Type')
ax.set_xticks(np.arange(len(wep_list)))
ax.set_yticks(np.arange(6))
ax.set_xticklabels(wep_list_label, rotation = 90)
#ax.set_xticklabels(targ_list, rotation = 90)
ax.imshow(b)
plt.show()
atk_list = gtd.attacktype1_txt.unique()
c = np.random.random((len(atk_list),(len(wep_list))))
for j in range(len(atk_list)):
for i in range(len(wep_list)):
c[j][i] = len(gtd[(gtd['attacktype1_txt'] == atk_list[j]) & (gtd['weaptype1_txt'] == wep_list[i])])
fig = plt.figure(figsize = (12,22), dpi = 150)
ax = fig.add_subplot(title = 'Attack vs Weapon Heatmap', xlabel = 'Weapon Type', ylabel = 'Attack Type')
wep_list_label[7] = "Vehicle"
ax.set_xticks(np.arange(len(wep_list)))
ax.set_yticks(np.arange(len(atk_list)))
ax.set_xticklabels(wep_list_label, rotation = 60)
ax.set_yticklabels(atk_list)
ax.imshow(c)
plt.show()
import re
print(len(gtd))
gtd = gtd.dropna(subset = ["motive"])
print(len(gtd))
gtd['motive'] = gtd['motive'].map(lambda x: re.sub('[,\.!?]', '', x))
gtd['motive'] = gtd['motive'].map(lambda x: x.lower())
gtd['motive'].head()
# Import the wordcloud library
from wordcloud import WordCloud
# Join the different processed titles together.
long_string = ','.join(list(gtd['motive'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', width = 800, height = 400)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
import gensim
from gensim.utils import simple_preprocess
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['unknown', 'attack', 'claimed', 'responsibility', 'incident', 'sources', 'stated', 'specific', 'however', 'noted', 'motive', 'carried',
'part', 'occurred', 'targeted', 'suspected', 'majority', 'recent', 'attacks', 'larger', 'trend', 'may', 'violence', 'related', 'al',
'also'])
def sent_to_words(sentences):
for sentence in sentences:
# deacc=True removes punctuations
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
data = gtd.motive.values.tolist()
data_words = list(sent_to_words(data))
# remove stop words
data_words = remove_stopwords(data_words)
print(data_words[:1])
long_string = ','.join(str(item) for innerlist in data_words for item in innerlist)
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue', width = 800, height = 400)
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()
import gensim.corpora as corpora
# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])
from pprint import pprint
# number of topics
num_topics = 10
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
num_topics=num_topics)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
import pyLDAvis.gensim_models
import pickle
import pyLDAvis
import os
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
LDAvis_prepared
topic_list = []
for i in range(len(corpus)):
tmp = lda_model.get_document_topics(corpus[i])
max_num = 0
max_j = 0
#print(tmp, end = '')
for j in range(len(tmp)):
if tmp[j][1] > max_num:
max_num = tmp[j][1]
max_j = j
topic_list.append(tmp[max_j][0])
#print(i, tmp[max_j])
gtd["motivation_topic"] = topic_list
gtd.head()
targ_list = gtd.targtype1_txt.unique()
d = np.random.random(((len(targ_list),10)))
for i in range(len(targ_list)):
for j in range(10):
d[i][j] = len(gtd[(gtd['targtype1_txt'] == targ_list[i]) & (gtd['motivation_topic'] == j)])/len(gtd[(gtd['motivation_topic'] == j)])
fig = plt.figure(figsize = (20,11), dpi = 150)
ax = fig.add_subplot(title = 'Motivation vs Target Heatmap', xlabel = 'Motivation Topic', ylabel = 'Target Type')
ax.set_yticks(np.arange(len(targ_list)))
ax.set_xticks(np.arange(10))
ax.set_yticklabels(targ_list)
#ax.set_yticklabels(atk_list)
ax.imshow(d)
plt.show()
e = np.random.random(((len(wep_list),10)))
for i in range(len(wep_list)):
for j in range(10):
e[i][j] = len(gtd[(gtd['weaptype1_txt'] == wep_list[i]) & (gtd['motivation_topic'] == j)])
fig = plt.figure(figsize = (10,5.5), dpi = 150)
ax = fig.add_subplot(title = 'Motivation vs Weapon Selection Heatmap', xlabel = 'Motivation Topic', ylabel = 'Target Type')
ax.set_yticks(np.arange(len(wep_list)))
ax.set_xticks(np.arange(10))
ax.set_yticklabels(wep_list_label)
#ax.set_yticklabels(atk_list)
ax.imshow(e)
plt.show()